feat: add Confluence metadata extractor by ravisuhag · Pull Request #515 · raystack/meteor

ravisuhag · 2026-04-18T22:36:44Z

Summary

Adds a new Confluence extractor that extracts page metadata and relationships from Confluence spaces via the REST API v2
Emits space and document entities with belongs_to, child_of, owned_by, and documented_by edges
Scans page content for URN references to auto-link documentation to data assets
Supports filtering by space keys and excluding specific spaces

Details

New files:

plugins/extractors/confluence/confluence.go — Main extractor with Config, Init, Extract
plugins/extractors/confluence/client.go — HTTP client for Confluence REST API v2 (spaces, pages, labels, cursor-based pagination)
plugins/extractors/confluence/confluence_test.go — 6 unit tests covering config validation, extraction, edges, URN detection, exclusion
plugins/extractors/confluence/README.md — Documentation
test/e2e/confluence_file/confluence_file_test.go — End-to-end test with mock server through full pipeline

Entities emitted:

Type	Description
`space`	Confluence space metadata
`document`	Page metadata (title, labels, version, timestamps)

Edges emitted:

Type	Source → Target	Description
`belongs_to`	document → space	Page belongs to a space
`child_of`	document → document	Page hierarchy
`owned_by`	document → user	Page author
`documented_by`	document → any	URN references found in page content

Closes #503 (Confluence portion)

Test plan

Unit tests pass (go test -tags plugins ./plugins/extractors/confluence/)
E2E test passes (go test -tags integration ./test/e2e/confluence_file/)
go build ./... succeeds
Review edge types and entity properties for consistency with existing extractors

Extract page metadata and relationships from Confluence spaces via the REST API v2. Emits space and document entities with belongs_to, child_of, owned_by, and documented_by edges. Scans page content for URN references to auto-link documentation to data assets.

vercel · 2026-04-18T22:36:49Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
meteor	Ready	Preview, Comment	Apr 18, 2026 10:47pm

coderabbitai · 2026-04-18T22:36:58Z

Warning

Rate limit exceeded

@ravisuhag has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 51 minutes and 2 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 51 minutes and 2 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2d8c9d6b-4c8a-46b1-83d4-cddae3e801d8

📥 Commits

Reviewing files that changed from the base of the PR and between fca4a24 and a41cb6b.

📒 Files selected for processing (3)

plugins/extractors/confluence/client.go
plugins/extractors/confluence/confluence.go
plugins/extractors/confluence/confluence_test.go

📝 Walkthrough

Walkthrough

A new Confluence extractor plugin was added to extract metadata and relationships from Confluence spaces and pages via the REST API v2. The implementation includes a REST client (client.go) for API interactions, core extractor logic (confluence.go) that retrieves spaces and pages, extracts document metadata, detects embedded URN references through regex scanning, and emits space and document records with relationship edges (belongs_to, child_of, owned_by, documented_by). Supporting tests and documentation were also added, along with plugin registration in the extractors populate file.

Sequence Diagram

sequenceDiagram
    participant E as Extract Flow
    participant C as Confluence Client
    participant API as Confluence REST API v2
    participant Emit as Record Emitter

    E->>C: GetSpaces(ctx, keys)
    C->>API: GET /spaces (with pagination via cursor)
    API-->>C: Spaces list
    C-->>E: []Space

    loop For each space (not excluded)
        E->>Emit: Emit space record
        E->>C: GetPages(ctx, spaceID)
        C->>API: GET /spaces/{id}/pages (cursor pagination, storage format)
        API-->>C: Pages list
        C-->>E: []Page

        loop For each page
            E->>C: GetPageLabels(ctx, pageID)
            C->>API: GET /pages/{id}/labels
            API-->>C: Labels
            C-->>E: []Label
            
            E->>E: Extract metadata, scan body for URNs
            E->>Emit: Emit document record
            E->>Emit: Emit belongs_to edge (space)
            E->>Emit: Emit child_of edge (parent page if exists)
            E->>Emit: Emit owned_by edge (author)
            E->>Emit: Emit documented_by edges (per detected URN)
        end
    end

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add Confluence metadata extractor' clearly describes the main change: adding a new Confluence extractor for metadata extraction.
Description check	✅ Passed	The description provides comprehensive details about the new Confluence extractor, including objectives, entities, edges, test coverage, and linked issues.
Linked Issues check	✅ Passed	The PR fully implements all coding requirements from issue `#503`: extracts page metadata (title, space, author, labels), page hierarchy relationships, emits documented_by edges via URN scanning, and supports space filtering/exclusion.
Out of Scope Changes check	✅ Passed	All changes are directly related to implementing the Confluence extractor specified in issue `#503`. The plugin registration, client implementation, extractor logic, and comprehensive tests are all in-scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/extractors/confluence/client.go`:
- Around line 152-159: GetPageLabels currently only fetches the first page;
implement cursor-based pagination like GetSpaces/GetPages by looping until there
is no next cursor. Change GetPageLabels to call c.get repeatedly with query
params limit and cursor (or follow the returned _links.next), accumulate
resp.Results into a single slice, and update the local resp struct to include
the pagination metadata (e.g., a Links or _links.next field) so you can extract
the next cursor; ensure errors from c.get are wrapped as before and return the
full aggregated []Label when done.

In `@plugins/extractors/confluence/confluence_test.go`:
- Around line 149-155: The test currently loops over records := emitter.Get()
and only asserts each found "space_key" is not "ARCHIVE", which silently passes
if no records are emitted; update the test to first assert that records is not
empty (e.g., assert.NotEmpty or assert.Greater(len(records), 0) on records
returned by emitter.Get()), then iterate the records from emitter.Get() and
assert that at least one record's props["space_key"] is present and not equal to
"ARCHIVE" (set a found flag while inspecting r.Entity().GetProperties().AsMap()
and assert the flag is true). Ensure you reference the same symbols
(emitter.Get(), records, r.Entity().GetProperties().AsMap(), "space_key") when
making these assertions so the test fails if extraction or filtering removes all
spaces.

In `@plugins/extractors/confluence/confluence.go`:
- Around line 90-95: The current code ignores errors from the emit callback and
downgrades e.extractPages failures to warnings; update the logic so emit
failures and page extraction errors are propagated up instead of suppressed:
check the return value from emit(e.buildSpaceRecord(space)) and return that
error if non-nil, and if e.extractPages(ctx, emit, space) returns an error
return it (don’t just log a warning). Apply the same change where pages are
emitted (the other occurrence noted around line 114) so both emit calls and all
e.extractPages failures bubble up to the caller.
- Around line 148-155: The timestamp formatting in the props map uses the
literal layout "2006-01-02T15:04:05Z" for page.CreatedAt and
page.Version.CreatedAt which forces a literal 'Z' instead of emitting proper
timezone offsets; update those calls to use time.RFC3339 (e.g.,
page.CreatedAt.Format(time.RFC3339) and
page.Version.CreatedAt.Format(time.RFC3339)) and add the missing import "time".
Ensure the changes are applied where props is constructed (referencing
page.CreatedAt and page.Version.CreatedAt) so timestamps include correct
timezone information.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e3d9b77-cb68-48d4-b786-32d81029a1cd

📥 Commits

Reviewing files that changed from the base of the PR and between 84a63a2 and fca4a24.

📒 Files selected for processing (6)

plugins/extractors/confluence/README.md
plugins/extractors/confluence/client.go
plugins/extractors/confluence/confluence.go
plugins/extractors/confluence/confluence_test.go
plugins/extractors/populate.go
test/e2e/confluence_file/confluence_file_test.go

coderabbitai · 2026-04-18T22:43:02Z

+		emit(e.buildSpaceRecord(space))
+
+		if err := e.extractPages(ctx, emit, space); err != nil {
+			e.logger.Warn("failed to extract pages from space, skipping",
+				"space", space.Key, "error", err)
+		}


⚠️ Potential issue | 🟠 Major

Propagate emit and page extraction failures.

Line 90 and Line 114 ignore emitter failures, and Lines 92-95 downgrade page extraction failures to a warning. That can make a run succeed with missing records or failed downstream writes.

Proposed error propagation

- emit(e.buildSpaceRecord(space)) + if err := emit(e.buildSpaceRecord(space)); err != nil { + return fmt.Errorf("emit space %s: %w", space.Key, err) + } if err := e.extractPages(ctx, emit, space); err != nil { - e.logger.Warn("failed to extract pages from space, skipping", - "space", space.Key, "error", err) + return fmt.Errorf("extract pages from space %s: %w", space.Key, err) }

- emit(e.buildPageRecord(page, space, spaceURN, labels)) + if err := emit(e.buildPageRecord(page, space, spaceURN, labels)); err != nil { + return fmt.Errorf("emit page %s: %w", page.ID, err) + }

Also applies to: 114-114

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@plugins/extractors/confluence/confluence.go` around lines 90 - 95, The current code ignores errors from the emit callback and downgrades e.extractPages failures to warnings; update the logic so emit failures and page extraction errors are propagated up instead of suppressed: check the return value from emit(e.buildSpaceRecord(space)) and return that error if non-nil, and if e.extractPages(ctx, emit, space) returns an error return it (don’t just log a warning). Apply the same change where pages are emitted (the other occurrence noted around line 114) so both emit calls and all e.extractPages failures bubble up to the caller.

Not applicable — plugins.Emit is defined as func(models.Record) with no return value, so there's no error to propagate. The warn-and-skip pattern for page extraction matches other extractors (e.g. GitHub).

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify the Emit type definition rg -n "type Emit" --type go # Check GitHub extractor for warn-and-skip pattern rg -n "Warn.*extract.*skip\|Warn.*pages\|extractPages" plugins/extractors/github/ -A2 2>/dev/null || true rg -n "logger.Warn.*skip" plugins/extractors/ --type go -A2 | head -40

Length of output: 2553

@ravisuhag, you're absolutely right — confirmed on both points:

plugins.Emit signature: plugins/plugin.go Line 19 defines it as type Emit func(models.Record) — no return value, so there's nothing to propagate.

Warn-and-skip pattern: The GitHub extractor uses the same pattern in multiple places (e.g., Lines 151, 256, 331, 345, 386, 391, 451 of plugins/extractors/github/github.go), making the Confluence extractor's approach consistent with the established convention.

My review comment was incorrect. Sorry for the noise!

✏️ Learnings added

Learnt from: ravisuhag Repo: raystack/meteor PR: 515 File: plugins/extractors/confluence/confluence.go:91-96 Timestamp: 2026-04-18T22:46:54.415Z Learning: In the raystack/meteor codebase, `plugins.Emit` is defined as `func(models.Record)` (no return value) in `plugins/plugin.go`. There is no error to propagate from emit calls. The warn-and-skip pattern (`e.logger.Warn(...)` + `continue`) for non-fatal per-item errors in extractors is the established convention, as used in `plugins/extractors/github/github.go` and others.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: CR Repo: raystack/meteor PR: 0 File: CLAUDE.md:0-0 Timestamp: 2026-04-18T03:36:16.367Z Learning: Applies to plugins/extractors/**/*.go : Extractors should emit Records containing Entity (urn, type, name, description, source, properties) and Edges (source_urn, target_urn, type, source, properties) for relationships

- Paginate GetPageLabels to capture all labels (not just first page) - Use time.RFC3339 for proper timezone handling in timestamps - Tighten exclusion test to assert ENG space exists (not just absence of ARCHIVE)

vercel Bot deployed to Preview April 18, 2026 22:38 View deployment

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

fix: address review feedback on confluence extractor

a41cb6b

- Paginate GetPageLabels to capture all labels (not just first page) - Use time.RFC3339 for proper timezone handling in timestamps - Tighten exclusion test to assert ENG space exists (not just absence of ARCHIVE)

vercel Bot deployed to Preview April 18, 2026 22:47 View deployment

ravisuhag merged commit b560a76 into main Apr 18, 2026
55 checks passed

ravisuhag deleted the feat/confluence-extractor branch April 18, 2026 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Confluence metadata extractor#515

feat: add Confluence metadata extractor#515
ravisuhag merged 2 commits intomainfrom
feat/confluence-extractor

ravisuhag commented Apr 18, 2026

Uh oh!

vercel Bot commented Apr 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Sequence Diagram

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 18, 2026 •

edited

Loading

Uh oh!

ravisuhag Apr 18, 2026

Uh oh!

coderabbitai Bot Apr 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ravisuhag commented Apr 18, 2026

Summary

Details

Test plan

Uh oh!

vercel Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Sequence Diagram

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravisuhag Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 18, 2026 •

edited

Loading

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading

coderabbitai Bot Apr 18, 2026 •

edited

Loading